Probabilistic Methods for Structured Document Classification at INEX'07
نویسندگان
چکیده
This paper exposes the results of our participation in the Document Mining track at INEX’07. We have focused on the task of classification of XML documents. Our approach to deal with structured document representations uses classification methods for plain text, applied to flattened versions of the documents, where some of their structural properties have been translated to plain text. We have explored several options to convert structured documents into flat documents, in combination with two probabilistic methods for text categorization. The main conclusion of our experiments is that taking advantage of document structure to improve classification results is a difficult task.
منابع مشابه
Probabilistic Methods for Link-Based Classification at INEX 2008
In this paper we propose a new method for link-based classification using Bayesian networks. It can be used in combination with any content only probabilistic classsifier, so it can be useful in combination with several different classifiers. We also report the results obtained of its application to the XML Document Mining Track of INEX’08.
متن کاملLink-Based Text Classification Using Bayesian Networks
In this paper we propose a new methodology for link-based document classification based on probabilistic classifiers and Bayesian networks. We also report the results obtained of its application to the XML Document Mining Track of INEX’09.
متن کاملINEX 2005 Multimedia Track
This paper reports on the activities of the INEX 2005 Multimedia track. The track was successful in realizing its objective to provide a pilot evaluation platform for the evaluation of retrieval strategies for XML-based multimedia documents. In this first exploratory year the focus of the evaluation experiment was to test approaches for the retrieval of XML fragments using a combination of cont...
متن کاملCheshire II at INEX: Using a Hybrid Logistic Regression and Boolean Model for XML Retrieval
This paper describes the retrieval approach that Berkeley used in the INEX evaluation. The primary approach is the combination of a probabilistic methods using a Logistic regression algorithm for estimation of collection relevance and element relevance, along with Boolean constraints. The paper also discusses our approach to XML component retrieval and how component and document retrieval are c...
متن کاملCheshire II at INEX ’03: Component and Algorithm Fusion for XML Retrieval
This paper describes the retrieval approach that UC Berkeley used in the 2003 INEX evaluation. As in last year’s INEX, our primary approach is the combination of a probabilistic methods using a Logistic regression algorithm for estimation of document (article) relevance and/or element relevance, along with Boolean constraints. This year we also used data fusion techniques to combine results fro...
متن کامل